About the project

I am feeling curious, excited and a bit nervous as well. I expect to learn about programming and data analysis in R langauage, so far I have been using spss. I want to also learn how to analyse big data and find patterns in data that could lead to new findings. I found the course through UEF WebOodi.

R markdown

Formatting text

I have written this portion where you can find
  1. textwritten in different formats
  2. clickable links
image

image

Here is the link to my GitHUB repository https://github.com/saranglq/IODS-project

Here is the link to my diary page https://saranglq.github.io/IODS-project/

Another link to my diary page


RStudio Exercise 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

I started this week with learning R through the Data Camp platform. I learned how to handle (wrangle) data in R. I learned how to modify existing datasets, merge different datasets, and carry out simple and multiple linear regressin analyses. I also learned how to make graphical representations of these analyses.

Dataset

The dataset given to us was from the “International survey of Approaches to Learning” funded by the Teacher’s Academy funding for Kimma Vehkalahti. Data was collected during 2013-2015

Oridinal dataset had 183 observations for 60 variables. Variables with prefix A adn C were based on componenets of A and C parts of ASSIST (Approaches and Study Skills Inventory for Students). Prefix D variables were based on SATS (Survey of Attitudes Toward Statistics). Items from the ASSIST B part were named so that their connection to the relevant dimension (Deep/Surface/Strategic) was maintained

Data wrangling

The dataset had multiple variables and I was supposed to merge assist B components into three major variables (Deep, Surface, Strategy). I also renamed three other variables (Age Attitude, Points -> age, attitude, points). Then I had to exclude other variables after selecting only 7 variables. The resulting dataset I saved as lrn2014.csv which has 7 variables and 166 observations for each variable.

Data distribution

Gender distribution was uneven, there were 110 female and 56 male subjects.

Distributions of each variable (except gender) are given below:

drawing drawing drawing drawing drawing drawing

Following is a summary table for all variables:

Variable Mean Median Min Max
Age 25.51 2.83 17 55
Attitude 31.43 32 14 50
Deep 3.68 3.18 1.58 4.91
Strategy 3.12 3.18 1.25 5.00
Surface 2.78 2.83 1.58 4.33
Points 22.72 23.00 7.00 33.00

Regression

I first carried out an exploratory analysis to find out which variables correlate best with points:

drawing

Attitude, strategy and surface had the strongest corellation and therefore were chosen as the predictor variables

I ran a multiple linear regression model and got following results

drawing

According to the results, attitude is the only variable that has significant linear relationship with points (p < 0.001). Each 1 point increase in attitude results in an increase of 0.34 in points.

Non-significant variables were removed and model was fitted again.

drawing

drawing

According to the final model, attitude has still a significant linear relationship with points, and each 1 point increase in attitude results in an increase of 0.35 in points. Multiple R-squared value is 0.19 which means that attitude explains 19 percent of variance in points.

Assumptions of the multiple linear model are that the

Errors are normally distributed Errors are not correlated Errors have constant variance The size of a given error does ot depend oin the explanatory variables

Diagnostic plots were created to check these assumptions

drawing

QQ-plot demostrates that the errors of the model are normally distributed, there are howeevr values at extremities that deviate from the normal line.

The scatterplot of residuals vs the fitted values shows no pattern, i.e. the spread remains largely similar with the increase in fitted value. Howeverthere are some residuals that are far form the fitted line. These could potentially be outliers

The residuals vs leverage plot shows that none of the eobservations have an unusually high impact on the results.


RStudio Exercise 3: Logistic regression

The current dataset is called “Student Performance Data Set” and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Student+Performance. Data compares student performance in two subjects: Portuguese and Mathematics.G1 referes to grades received during the first period and similary G2 and G3 correspond to grades received in second and third periods.

library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)

After data wrangling the dataset was saved as alc.csv which contains 35 variable and 182 observations

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
## [1] 382  35

Hypotheses

I would be interested in examining the relationship between alcohol use and:

  1. Parental cohabitation: I think that tense family relationships can lead to a higher stress in teenagers and therfore a tendency towards more consumption of alcohol.

  2. Mother’s education: Research shows that higher education for women results in multipe positive outcomes for family and society. This could also reflect in the attitudes of children.

  3. Romantic relationships: Failure to bond during teenage years and peer pressure to be in a relationship can result in higher tendency to drink.

  4. Current health status: It might be a causal factor for drinking but also vice versa i.e. excessive drinking could be related to bad health status.

Distribution of variables

Parental cohabitation and romantic relationship are binomial variables. Mother’s eductation and health are categporical, ordinal variables. The bar charts showing their distributions are presented below.

The charts show the distribution of subects based on their relationship status, parental cohabitation and usage of alcohol.

## # A tibble: 4 x 3
## # Groups:   Pstatus [2]
##   Pstatus high_use count
##   <fct>   <lgl>    <int>
## 1 A       FALSE       26
## 2 A       TRUE        12
## 3 T       FALSE      242
## 4 T       TRUE       102
## # A tibble: 4 x 3
## # Groups:   romantic [2]
##   romantic high_use count
##   <fct>    <lgl>    <int>
## 1 no       FALSE      180
## 2 no       TRUE        81
## 3 yes      FALSE       88
## 4 yes      TRUE        33

Logistic regression

None of the chosen variables are significantly associated with high alchohol use. The variables and their resulting odds ratios for high alcohol use are presented along with the confidence intervals. Odds ratios along with confidence intervals for association of each variable with high alcohol use are also presented,

## 
## Call:
## glm(formula = high_use ~ Pstatus + Medu + romantic + health, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4627  -0.8821  -0.7091   1.2911   2.0228  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.10674    1.32994   0.080    0.936
## PstatusT    -0.08875    0.38587  -0.230    0.818
## Medu1       -0.91856    1.27172  -0.722    0.470
## Medu2       -1.90163    1.26062  -1.508    0.131
## Medu3       -0.91270    1.25197  -0.729    0.466
## Medu4       -1.27811    1.24739  -1.025    0.306
## romanticyes -0.15946    0.25057  -0.636    0.525
## health2      0.76463    0.48160   1.588    0.112
## health3      0.13564    0.44268   0.306    0.759
## health4      0.32219    0.45049   0.715    0.474
## health5      0.63142    0.39548   1.597    0.110
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 448.12  on 371  degrees of freedom
## AIC: 470.12
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
##                    OR       2.5 %    97.5 %
## (Intercept) 1.1126471 0.086924876 27.053858
## PstatusT    0.9150754 0.437244590  2.007650
## Medu1       0.3990923 0.017558764  4.548870
## Medu2       0.1493248 0.006654147  1.665484
## Medu3       0.4014386 0.018078594  4.410220
## Medu4       0.2785625 0.012611116  3.033450
## romanticyes 0.8526065 0.517603874  1.385456
## health2     2.1482059 0.843454699  5.637405
## health3     1.1452705 0.486362057  2.792529
## health4     1.3801401 0.576722981  3.413775
## health5     1.8802695 0.888022503  4.233663

Choosing alternative variables

I chose study time, activities and going out with friends. Distributions of these variables are presented through bar charts.

Based on the distributions I chose to explore gender, study time, activities and going out. Gender and going out were significantly associated with high alcohol use. Study time also had a significant but weak association.

Males were more likely to drink heavily (OR 2.24 CI95% 1.32 - 3.84). Compared to the reference group, students who went out most frequently were more likely to consume alcohol heavily (OR 10.73 CI 95% 3.036 - 51.42)

## 
## Call:
## glm(formula = high_use ~ sex + studytime + activities + goout, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7617  -0.7725  -0.5228   0.8228   2.5617  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.8672     0.6722  -2.778 0.005474 ** 
## sexM            0.8073     0.2721   2.966 0.003012 ** 
## studytime2     -0.3456     0.2970  -1.164 0.244524    
## studytime3     -0.9712     0.4809  -2.019 0.043455 *  
## studytime4     -1.1241     0.6298  -1.785 0.074310 .  
## activitiesyes  -0.4044     0.2590  -1.561 0.118454    
## goout2          0.3503     0.6956   0.504 0.614510    
## goout3          0.5243     0.6813   0.770 0.441507    
## goout4          2.0665     0.6816   3.032 0.002430 ** 
## goout5          2.3735     0.7020   3.381 0.000722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 384.35  on 372  degrees of freedom
## AIC: 404.35
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
##                       OR      2.5 %     97.5 %
## (Intercept)    0.1545556 0.03368531  0.5095379
## sexM           2.2418487 1.32119716  3.8496160
## studytime2     0.7077842 0.39492190  1.2688956
## studytime3     0.3786406 0.14036906  0.9407166
## studytime4     0.3249548 0.08308952  1.0313723
## activitiesyes  0.6673741 0.39987596  1.1063789
## goout2         1.4195109 0.40277427  6.7023293
## goout3         1.6893594 0.49743864  7.8225116
## goout4         7.8972057 2.33914417 36.6844868
## goout5        10.7354006 3.03673066 51.4240176

Predictions using the model

Previously significant variables (sex, studying time, going out) were chosen for this model. The chart represents the high/non “high users” that were users that were classified correctly by the model as high users that were classified correctly or incorrectly. The tabulated chart shows the proportion of subjects classified correctly and incorrectly by the model.

##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65183246 0.04973822 0.70157068
##    TRUE  0.15445026 0.14397906 0.29842932
##    Sum   0.80628272 0.19371728 1.00000000

Loss function of the training model was:

## [1] 0.2041885

If a simple guessing strategy would be employed, there would be a 50 percent change of getting a correct answer since the outcome is a binomial outcome. The model however is more efffective as it gives a correct result 80 percent of the time, as demonstrated through the training set.

Bonus exercise

## [1] 0.2251309

The 10 fold cross validation test resulted in a 0.22 error for my model, which is less than 0.26 for the model introduced in DataCamp.

Super bonus

Let us fit all the variables into a logistic regression model.

##  [1] "school"      "sex"         "age"         "address"     "famsize"    
##  [6] "Pstatus"     "Medu"        "Fedu"        "Mjob"        "Fjob"       
## [11] "reason"      "nursery"     "internet"    "guardian"    "traveltime" 
## [16] "studytime"   "failures"    "schoolsup"   "famsup"      "paid"       
## [21] "activities"  "higher"      "romantic"    "famrel"      "freetime"   
## [26] "goout"       "Dalc"        "Walc"        "health"      "absences"   
## [31] "G1"          "G2"          "G3"          "alc_use"     "high_use"   
## [36] "probability" "prediction"
## 
## Call:
## glm(formula = high_use ~ school + sex + studytime + goout + Pstatus + 
##     Medu + Fedu + Mjob + Fjob + reason + nursery + internet + 
##     guardian + traveltime + studytime + failures + schoolsup + 
##     famsup + paid + activities + higher + romantic + famrel + 
##     freetime + goout + health + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9372  -0.6024  -0.2891   0.3696   2.9525  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -16.00788  998.11272  -0.016 0.987204    
## schoolMS          -0.31881    0.56801  -0.561 0.574610    
## sexM               1.14045    0.36391   3.134 0.001725 ** 
## studytime2        -0.10816    0.39602  -0.273 0.784764    
## studytime3        -0.58458    0.62948  -0.929 0.353053    
## studytime4        -0.51937    0.81835  -0.635 0.525648    
## goout2             0.35067    0.92424   0.379 0.704381    
## goout3             0.39244    0.89417   0.439 0.660744    
## goout4             2.48140    0.90743   2.735 0.006247 ** 
## goout5             2.57466    0.94753   2.717 0.006583 ** 
## PstatusT          -0.64279    0.57431  -1.119 0.263036    
## Medu1             -1.46770    1.63389  -0.898 0.369033    
## Medu2             -2.29763    1.63061  -1.409 0.158818    
## Medu3             -0.97329    1.65010  -0.590 0.555302    
## Medu4             -1.11258    1.71722  -0.648 0.517053    
## Fedu1             13.12642  998.10935   0.013 0.989507    
## Fedu2             13.35623  998.10933   0.013 0.989323    
## Fedu3             13.14915  998.10933   0.013 0.989489    
## Fedu4             13.46401  998.10938   0.013 0.989237    
## Mjobhealth        -1.49344    0.86167  -1.733 0.083061 .  
## Mjobother         -0.81758    0.57177  -1.430 0.152744    
## Mjobservices      -1.17488    0.64412  -1.824 0.068150 .  
## Mjobteacher       -0.54941    0.80208  -0.685 0.493354    
## Fjobhealth         0.51742    1.31546   0.393 0.694071    
## Fjobother          1.19438    0.99984   1.195 0.232255    
## Fjobservices       1.97809    1.02534   1.929 0.053705 .  
## Fjobteacher       -0.58137    1.19604  -0.486 0.626915    
## reasonhome         0.58678    0.43925   1.336 0.181590    
## reasonother        1.59275    0.60414   2.636 0.008379 ** 
## reasonreputation   0.12521    0.45818   0.273 0.784633    
## nurseryyes        -0.38504    0.41007  -0.939 0.347750    
## internetyes       -0.18085    0.47879  -0.378 0.705634    
## guardianmother    -0.65645    0.39486  -1.662 0.096413 .  
## guardianother     -0.70067    0.94726  -0.740 0.459495    
## traveltime2       -0.30249    0.38375  -0.788 0.430558    
## traveltime3        1.91997    0.80682   2.380 0.017328 *  
## traveltime4        3.26053    1.12585   2.896 0.003779 ** 
## failures           0.17111    0.29453   0.581 0.561252    
## schoolsupyes      -0.35779    0.50971  -0.702 0.482709    
## famsupyes         -0.08541    0.35824  -0.238 0.811558    
## paidyes            0.67650    0.35640   1.898 0.057673 .  
## activitiesyes     -0.35070    0.34488  -1.017 0.309210    
## higheryes         -0.11162    0.78940  -0.141 0.887556    
## romanticyes       -0.45218    0.36731  -1.231 0.218300    
## famrel2            0.67295    1.40011   0.481 0.630773    
## famrel3            0.63499    1.27185   0.499 0.617593    
## famrel4           -0.54507    1.24670  -0.437 0.661956    
## famrel5           -1.36245    1.26822  -1.074 0.282688    
## freetime2          1.02423    1.12818   0.908 0.363952    
## freetime3          1.57764    1.09818   1.437 0.150831    
## freetime4          2.01108    1.12377   1.790 0.073521 .  
## freetime5          1.50294    1.19980   1.253 0.210329    
## health2            1.09810    0.72431   1.516 0.129504    
## health3            0.50881    0.68053   0.748 0.454660    
## health4            0.74525    0.68633   1.086 0.277547    
## health5            0.97885    0.61138   1.601 0.109368    
## absences           0.10369    0.03037   3.415 0.000639 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 292.37  on 325  degrees of freedom
## AIC: 406.37
## 
## Number of Fisher Scoring iterations: 14
## [1] 0.1910995
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] 0.2565445

The training error of this model was 0.19 and the tesing error was 0.24

The significant variables were selected and another model was created.

## 
## Call:
## glm(formula = high_use ~ sex + goout + reason + traveltime + 
##     absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7762  -0.7111  -0.4500   0.6284   2.5265  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.31613    0.76071  -4.359 1.31e-05 ***
## sexM              0.97063    0.27099   3.582 0.000341 ***
## goout2            0.49127    0.74900   0.656 0.511890    
## goout3            0.54824    0.73191   0.749 0.453824    
## goout4            2.23550    0.73810   3.029 0.002456 ** 
## goout5            2.35217    0.75767   3.104 0.001906 ** 
## reasonhome        0.36755    0.33529   1.096 0.272983    
## reasonother       1.15017    0.46511   2.473 0.013403 *  
## reasonreputation -0.16018    0.35498  -0.451 0.651819    
## traveltime2      -0.11500    0.30300  -0.380 0.704275    
## traveltime3       1.47622    0.57538   2.566 0.010298 *  
## traveltime4       1.87986    0.92470   2.033 0.042057 *  
## absences          0.08835    0.02377   3.717 0.000202 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 360.59  on 369  degrees of freedom
## AIC: 386.59
## 
## Number of Fisher Scoring iterations: 4
## [1] 0.1884817
## [1] 0.2198953

This model had a better performance on the testing set, i.e. testing error was 0.21. Finally I decided to remove reason (to choose school) and travel time variables (because they were weak predictors and the category other for reason variable was hard to determine).

## 
## Call:
## glm(formula = high_use ~ sex + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7601  -0.7278  -0.4717   0.7663   2.2870  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.89554    0.71197  -4.067 4.76e-05 ***
## sexM         1.02291    0.26103   3.919 8.90e-05 ***
## goout2       0.35631    0.73254   0.486 0.626685    
## goout3       0.42736    0.71684   0.596 0.551053    
## goout4       1.99158    0.71679   2.778 0.005461 ** 
## goout5       2.25925    0.73593   3.070 0.002141 ** 
## absences     0.08396    0.02278   3.685 0.000229 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 380.07  on 375  degrees of freedom
## AIC: 394.07
## 
## Number of Fisher Scoring iterations: 4
## [1] 0.2068063
## [1] 0.2094241

Final model had 3 predictors and a testing set loss function of 0.21.

##   n_variables training_error
## 1          27      0.1910995
## 2           5      0.1884817
## 3           3      0.2068063
##   n_variables testing_error
## 1          27     0.2565445
## 2           5     0.2198953
## 3           3     0.2094241


RStudio Exercise 4: Clustering and classification

library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)

Now we shall load the dataset Boston from MASS package and explore it.

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
dim(Boston)
## [1] 506  14

The data has 14 variables and 506 measurements. Each measurement corresponds to a suburb of Boston i.e. 506 suburbs in total for this dataset. The variables are described in detail at https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

Overview of data

Data contains 12 numeric variables and 2 integer variables.

The variables are
  1. Crime per capita: Left skewed.
  2. Proportion of res zone with lots of over 25000 sq. ft.: Left skewed
  3. Proportion of industrial zones: Left skewed with peak at 20%
  4. Bordering Charles river: About 40 towns
  5. NO concentration: Left skewed, from 0.4 to 0.9 PP10M
  6. Avg. number of rooms per dwelling: Normally distributed with peak at 6 to 7
  7. Proportion of owner-occupied units built prior to 1940: Right skewed.
  8. Mean of distances to 5 Boston employment centres: Left skewed. Range form 1 to 12.5 (km? miles?).
  9. Index of accessibility to highways: Range 1 to 24. Uneven, u-shaped distribution.
  10. Property tax rate per $10,000: Uneven distribution.
  11. Pupil to teacher ratio: 12 to 23, right skewed.
  12. Proportion (modified) of Black people by town: Right skewed. Range 0.32 - 396.9.
  13. Lower status of population (in %): Right skewed, peak at 5%.
  14. Median value of homes (owner occupied) in $1000s: Normally distributed, peak at 20.
varlabels = c("per capita crime rate","proportion of residential land zoned","proportion of industrial zones per town","towns bordering charles river N/Y","NO concentration, PP10M","average number of rooms per dwelling","proportion of units built pre-1940", "weighted means of distance to 5 Boston employment centres","index of accessibility to radial highways","property tax rate per $10,000","pupil to teacher ratio", "Proortion of blacks by town","lower status of population","median value of homes in $1000s")


selcols <- colnames(Boston)
selvars <- dplyr::select(Boston, one_of(selcols))
a=0

for(var in selvars) {
   
  if(is.integer(var)){
    a = a +1
    plot <-ggplot(Boston, aes(var, ..count..)) + geom_bar() + xlab(varlabels[a]) + ylab("Number of suburbs")
  print(plot)

} else { 
    a = a +1
    plot <- qplot(var,
      geom="histogram",
      binwidth = ((1/7.9) * sd(var)) * 3.49,  
      main = paste("Histogram for", varlabels[a])  ,  
      xlab = (varlabels[a]),  
      fill=I("blue"), 
      col=I("red"), 
      alpha=I(.2))
    print(plot)
    }
}

cor_matrix<-cor(Boston) 

corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The correlation matrix shows that the strongest positive correlations are between 1. Accessibility to highways and property tax rate, 2. Industrialization and NO concentration and 3. Number of rooms per dwelling and median value of homes. The strongest negative corellations are between 1. Proportion of buildings built prior to 1940 and distances to employment centres, 2. NO concentration and proportion of buildings built prior to 1940, 3. Proportion of industrial zones and distances to employment centres and 4 Percent lower status of population and median value of owner occupied homes.

Already many patterns can be drawn from this data. Accessibility to highways means higher development and could mean higher property tax rates. More industrial zones can explain higher NO concentrations. Number of rooms per dwelling could explain the median value of homes as larger homes are more expensive. Negative relationship between percent of lower status population and median value of homes shows that richer 0suburbs have wealthier people.

Standardizing the data

boston_scaled <- scale(Boston)

summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)

bins <- quantile(boston_scaled$crim)

label <- c("low", "med_low", "med_high", "high")

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = label)

boston_scaled <- dplyr::select(boston_scaled, -crim)

boston_scaled <- data.frame(boston_scaled, crime)

After standardizing the data, the mean of all variables has been centered towards the zero and the range has decreased, but still maintaining similar proportions between them.

Now we shall split the data into train and test set.

n <- nrow(boston_scaled)

ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]

test <- boston_scaled[-ind,]

dim(train)
## [1] 404  14
dim(test)
## [1] 102  14

Linear discriminant analysis

Drawing the LDA bi-plot. Categorical crime rate variable was used as the target and all other variables were used as predictors.

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)

lda.arrows(lda.fit, myscale = 3)

Testing the model on test dataset

correct_classes <- test$crime

test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
##           predicted
## correct    low med_low med_high high Sum
##   low       18       7        3    0  28
##   med_low    3      17        3    0  23
##   med_high   0       8       17    3  28
##   high       0       0        1   22  23
##   Sum       21      32       24   25 102

When testing the trained model on test dataset, the model predicted correctly 9 out of 23 low crime, 15 out of 26 med_low crime, 15 out of 23 med_high crime and 30 out of 30 high crime suburbs.

K-means clustering

data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

dist_eu <- dist(Boston, method = "euclidean")
dist_man <- dist(Boston, method = 'manhattan')

km <-kmeans(boston_scaled, centers = 3)

pairs(boston_scaled[6:10], col = km$cluster)

Checking the optimum number of clusters and re-analyzing

set.seed(123)

k_max <- 10

twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

qplot(x = 1:k_max, y = twcss, geom = 'line')

km <-kmeans(boston_scaled, centers = 4)

pairs(boston_scaled[6:10], col = km$cluster)

I could not find a clear drop in WCSS after two so I set the number of clusters at 3.

If we draw a pairs plot, we can see that we cannot separate the neighborhoods based on two variables. Some variables are good at separating 2 clusters, but almost no combination of two variables can separate three clusters in a good way.

Bonus

So why don’t we try separating the clusters based on LDA instead of pairs? I chose 4 clusters for this purpose because it gives a good speeration of data.

data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

set.seed(123)

km <- kmeans(boston_scaled, centers = 4)

lda.fit <- lda(km$cluster~ ., data = boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(km$cluster)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 7)

According to the results from the LDA plot, proportion of black people in town, nitric oxide concentrations, industrialization, tax rate and crime rate were the biggest factors separating suburbs into four categories.

Higher crime rate clustered the suburbs in cluster one. Lesser (or Higher? unclear from the variable description and formula) proportion of black people clustered the towns in the opposite direction. NO concentration, industrialization and tax rate clustered suburbs towards left, i.e. clusters 1 and 2.It can also be seen that these zones were more accessible to radial highways.

superbonus

We predicted crime rate category using other variables and plotted it in 3D!

The individual points were colored in the first graph according to the crime rate category and in the second graph using k means clustering (data was clustered into 4 groups in order to match with four categories of crime rate).

Even though in the second 3D graph the data was clustered using all available data and not specifically to predict crime, note that it still does a pretty good job in separating suburbs into four groups according to crime rate.

#PLOTTING 3D GRAPH FOR TRAIN SET WITH CRIME

data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]

km <- kmeans(train, centers = 4)

bins <- quantile(train$crim)
label <- c("low", "med_low", "med_high", "high")
crime <- cut(train$crim, breaks = bins, include.lowest = TRUE, label = label)
train <- data.frame(train, crime)

lda.fit <- lda(crime ~ ., data = train)

classes <- as.numeric(train$crime)

model_predictors <- dplyr::select(train, -crime)

matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)

RStudio Exercise 5: Dimentionality reduction techniques

Loading the required libraries

library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)
library(stringr)
library(FactoMineR)

Data wrangling

Loading human.csv data and carrying out the required data wrangling.

## 'data.frame':    195 obs. of  19 variables:
##  $ hdirank     : int  1 2 3 4 5 6 6 8 9 9 ...
##  $ country     : Factor w/ 195 levels "Afghanistan",..: 129 10 169 48 124 67 84 186 34 125 ...
##  $ hdi         : num  0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
##  $ lifeexp     : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ eduexp      : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ meanedu     : num  12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
##  $ gni         : Factor w/ 194 levels "1,096","1,123",..: 166 135 156 139 140 137 127 154 134 117 ...
##  $ gniminusrank: int  5 17 6 11 9 11 16 3 11 23 ...
##  $ giirank     : int  1 2 3 4 5 6 6 8 9 9 ...
##  $ gii         : num  0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
##  $ matmor      : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adobir      : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ fparli      : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  $ secedf      : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ secedm      : num  96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
##  $ labf        : num  61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
##  $ labm        : num  68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
##  $ secedfm     : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labfm       : num  0.891 0.819 0.825 0.884 0.829 ...
## [1] 195  19

The variables have been derived from two datasets. First is the data used to create Human Devlopment Index (HDI). HDI was developed by the UNDP. The Gender Inequality Index is also devloped by the UNDP. According to the UNDP website, It measures gender inequalities in three important aspects of human development—reproductive health, measured by maternal mortality ratio and adolescent birth rates; empowerment, measured by proportion of parliamentary seats occupied by females; and economic status, expressed as labour market participation. The variables we will be using from these dataset are:

country: The country being described hdirank: HDI rank, which tells what is the rank of the country in terms of Human development index hdi: The Human Devleopment Index score lifeexp: Life expectancy at birth eduexp: Expected years of schooling meanedu: Mean years of schooling gni: gross national income per capita gniminusrank: Gross national income per capita rank minus HDI rank giirank: Gender Inequality Index rank gii: The Gender Inequality Index score matmor: Maternal mortality ratio adobir: Adolescent birth rate fparli: Percent representation of women in parliament secedf: Percent population (women) with secondary education secedm: Percent population (men) with secondary education labf: Percent participation (of women) in labor force labm: Percent participation (of men) in labor force

Using the above variables, following additional variables were created

secedfm: Ratio between % women and % men with secondary education labfm: Ratio between % women and % men in labor force

Now let’s make the changes as instructed in the exercise 5, data wrangling part

##       secedfm     labfm eduexp lifeexp    gni matmor adobir fparli  comp
## 1   1.0072389 0.8908297   17.5    81.6  64992      4    7.8   39.6  TRUE
## 2   0.9968288 0.8189415   20.2    82.4  42261      6   12.1   30.5  TRUE
## 3   0.9834369 0.8251001   15.8    83.0  56431      6    1.9   28.5  TRUE
## 4   0.9886128 0.8840361   18.7    80.2  44025      5    5.1   38.0  TRUE
## 5   0.9690608 0.8286119   17.9    81.6  45435      6    6.2   36.9  TRUE
## 6   0.9927835 0.8072289   16.5    80.9  43919      7    3.8   36.9  TRUE
## 7   1.0241730 0.7797357   18.6    80.9  39568      9    8.2   19.9  TRUE
## 8   1.0031646 0.8171263   16.5    79.1  52947     28   31.0   19.4  TRUE
## 9   1.0000000 0.8676056   15.9    82.0  42155     11   14.5   28.2  TRUE
## 10  0.9968520 0.8401084   19.2    81.8  32689      8   25.3   31.4  TRUE
## 11  0.9148148 0.7616580   15.4    83.0  76628      6    6.0   25.3  TRUE
## 12  0.9116162 0.7566372   15.6    84.0  53959     NA    3.3     NA FALSE
## 13         NA        NA   15.0    80.0  79851     NA     NA   20.0 FALSE
## 14  0.9908362 0.8880707   15.8    82.2  45636      4    6.5   43.6  TRUE
## 15  0.9989990 0.8107715   16.2    80.7  39267      8   25.8   23.5  TRUE
## 16  0.9934498 0.9108527   19.0    82.6  35182      4   11.5   41.3  TRUE
## 17  0.8641975 0.6948682   16.9    81.9  33890     27    2.2   16.3  TRUE
## 18  0.9667812 0.8379161   16.0    82.4  30676      2    7.8   22.5  TRUE
## 19  1.0000000 0.7848297   13.9    81.7  58711     11    8.3   28.3  TRUE
## 20  1.0139860 0.6931818   15.3    83.5  36927      6    5.4   11.6  TRUE
## 21  0.9348613 0.8010118   16.3    80.8  41187      6    6.7   42.4  TRUE
## 22  0.9375000 0.8230519   16.0    82.2  38056     12    5.7   25.7  TRUE
## 23  1.0000000 0.8064993   15.7    81.4  43869      4    4.1   30.3  TRUE
## 24  1.0000000 0.8703125   17.1    80.8  38695      4    9.2   42.5  TRUE
## 25  0.9775510 0.8275316   16.8    80.4  27852      7    0.6   27.7  TRUE
## 26  0.9138167 0.7978723   17.3    82.6  32045      4   10.6   38.0  TRUE
## 27  0.8844720 0.6655462   16.0    83.1  33030      4    4.0   30.1  TRUE
## 28  1.0020060 0.7481698   16.4    78.6  26660      5    4.9   18.9  TRUE
## 29  0.8880597 0.7072000   17.6    80.9  24524      5   11.9   21.0  TRUE
## 30  1.0000000 0.8156749   16.5    76.8  25214     11   16.8   19.8  TRUE
## 31  0.9424779 0.6985392   14.5    78.8  72570     27   23.0     NA FALSE
## 32  0.9302326 0.7876231   14.0    80.2  28633     10    5.5   12.5  TRUE
## 33  1.1305085 0.5319372   13.8    78.2 123124      6    9.5    0.0  TRUE
## 34  1.0040568        NA   13.5    81.3  43978     NA     NA   50.0 FALSE
## 35  0.9959799 0.7448980   15.1    76.3  25845      7   15.9   18.7  TRUE
## 36  0.9286550 0.7534669   15.5    77.4  23177      3   12.2   22.1  TRUE
## 37  0.9448568 0.8291233   16.4    73.3  24500     11   10.6   23.4  TRUE
## 38  0.8772379 0.5716440   14.4    80.6  27930      9   18.2   13.0  TRUE
## 39  0.8605974 0.2579821   16.3    74.3  52821     16   10.2   19.9  TRUE
## 40  0.9774306 0.6333333   17.9    76.3  22050     69   54.4   36.8  TRUE
## 41  1.1944444 0.5054348   13.3    77.0  60868      8   27.6   17.5  TRUE
## 42  0.9594241 0.6577540   15.2    81.7  21290     22   55.3   15.8  TRUE
## 43  0.9896266 0.8293051   16.3    80.9  25757      8   12.6   31.3  TRUE
## 44  0.9918946 0.7466667   15.4    75.2  22916     14   12.1   10.1  TRUE
## 45  1.1031128 0.4510932   14.4    76.6  38599     22   13.8   15.0  TRUE
## 46  0.9989899 0.8121302   15.2    74.2  22281     13   13.5   18.0  TRUE
## 47  0.9081197 0.7654110   14.8    77.3  19409     13   12.7   25.8  TRUE
## 48  0.9875666 0.5246691   14.7    74.4  83961     14   14.5    1.5  TRUE
## 49  0.8891235 0.7504363   15.2    76.2  14558      7   15.2   17.3  TRUE
## 50  0.9436009 0.7939778   15.7    71.3  16676      1   20.6   30.1  TRUE
## 51  0.9686486 0.7963738   14.7    70.1  22352     24   25.7   14.5  TRUE
## 52  0.8266200 0.3510896   13.6    76.8  34858     11   10.6    9.6  TRUE
## 53  0.9358696 0.7503852   14.2    74.7  18108     33   31.0   12.0  TRUE
## 54  1.0815109 0.7239583   15.5    77.2  19283     14   58.3   11.5  TRUE
## 55  1.0410959 0.8738966   12.6    75.4  21336     37   28.5   16.7  TRUE
## 56  0.9645749 0.8690629   15.0    69.4  20867     26   29.9   20.1  TRUE
## 57  1.0205245 0.8603133   15.4    75.6  12488     52   48.4   19.6  TRUE
## 58         NA        NA   14.0    76.1  20070     NA   49.3   25.7 FALSE
## 59  0.9717868 0.8118644   14.4    74.2  15596      5   35.9   20.4  TRUE
## 60         NA        NA   13.7    72.7  13496     NA     NA   10.3 FALSE
## 61  1.0821643 0.5990220   13.3    77.6  18192     85   78.5   19.3  TRUE
## 62  0.9130435 0.5880795   12.7    74.7  22762     29    5.7   14.2  TRUE
## 63  0.8517241 0.5876011   15.6    74.4  17470     73   30.9   11.6  TRUE
## 64  1.0045045        NA   13.4    73.1  23300     NA   56.3   43.8 FALSE
## 65  0.9802956 0.7019868   12.3    70.4  26090     84   34.8   24.7  TRUE
## 66  0.7934783 0.7307061   14.4    74.9  12190     16   16.9   34.0  TRUE
## 67  0.9428934 0.6200000   13.8    79.4   7301     80   43.1   48.9  TRUE
## 68  0.9566787 0.3286319   13.8    79.3  16509     16   12.0    3.1  TRUE
## 69  1.0039604 0.5898734   13.9    79.4  13413     38   60.8   33.3  TRUE
## 70  0.9201183 0.2255435   15.1    75.4  15440     23   31.6    3.1  TRUE
## 71  1.1141732 0.6452020   14.2    74.2  16159    110   83.2   17.0  TRUE
## 72  0.6500000 0.4152542   14.5    75.3  18677     20   30.9   14.4  TRUE
## 73  0.9515707 0.4600262   13.7    74.9   9779     29   16.9    5.8  TRUE
## 74  0.9191419 0.5644556   13.1    76.8  16056     49   63.4   37.1  TRUE
## 75  1.0419847 0.7351485   15.2    74.5  15175     69   70.8    9.6  TRUE
## 76  0.9676375 0.7523302   13.8    74.9   7164     41   46.8   11.3  TRUE
## 77         NA        NA   12.9    73.8  20805     NA     NA    6.7 FALSE
## 78  0.9620123 0.9037356   11.9    70.8  16428     26   40.0   15.6  TRUE
## 79         NA        NA   15.8    73.4  10939     23   35.4   25.0 FALSE
## 80  0.8853503 0.2342342   13.5    74.0  11365     50   26.5   11.6  TRUE
## 81  0.7230216 0.6385185   13.4    75.4  11780      7   18.3   33.3  TRUE
## 82  0.9562044 0.7952167   15.1    71.0   8178     23   25.7   11.8  TRUE
## 83  0.8612903 0.2105263   14.0    74.8  13054     89   10.0   25.7  TRUE
## 84  0.8517398 0.8080569   13.1    74.6  11015     89   50.7   22.3  TRUE
## 85  0.9306030 0.6854962   11.8    77.8   9943     21   15.3   20.7  TRUE
## 86  0.9894737 0.7465565   12.3    74.7   8124     29   27.1   10.7  TRUE
## 87  0.6432665 0.5951134   13.6    76.5   9638      8   15.1   19.3  TRUE
## 88  1.0177665 0.6614268   14.2    75.9  10605     87   77.0   41.6  TRUE
## 89         NA 0.8228346   12.6    75.1   9765     34   56.3   20.7 FALSE
## 90  0.8164117 0.8160920   13.1    75.8  12547     32    8.6   23.6  TRUE
## 91  0.9953488 0.5208333   15.7    70.0   7493     59   42.8   14.0  TRUE
## 92  1.0142687 0.8167388   14.6    69.4  10729     68   18.7   14.9  TRUE
## 93  0.8750000 0.7967782   13.5    74.4  13323     26   41.0    6.1  TRUE
## 94  1.2801724        NA   12.7    77.8   9994     NA     NA   21.9 FALSE
## 95  1.3245823 0.3926702   14.0    71.6  14911     15    2.5   16.0  TRUE
## 96  0.7114967 0.3540197   14.6    74.8  10404     46    4.6   31.3  TRUE
## 97  1.0233813 0.7001255   13.5    74.0  12040     83   68.5   20.9  TRUE
## 98         NA 0.7141026   13.4    72.9   9937     45   54.5   13.0 FALSE
## 99  1.0541311 0.7912553   12.4    75.7   7415     80   70.1   16.7  TRUE
## 100 0.9909400 0.7171582   14.7    72.8   5069    120   18.1    0.0  TRUE
## 101 1.0079156 0.5978129   13.6    70.0   7614     45   71.4   13.3  TRUE
## 102 1.0470810 0.6526718   13.1    73.5  11883    100   99.6   19.1  TRUE
## 103 0.9469214 0.5886628   12.7    71.1  15617    130   35.2   11.8  TRUE
## 104 0.8348624 0.7251613   13.0    76.8  12328     31    4.2    5.9  TRUE
## 105 1.0716667 0.4023973   12.9    73.4   5327     58   28.3    6.1  TRUE
## 106 0.9448010 0.8811275   12.5    64.5  16646    170   44.2    9.5  TRUE
## 107 0.9689441 0.8506787   11.9    71.6   5223     21   29.3   20.8  TRUE
## 108 0.7244224 0.3168449   13.5    71.1  10512     45   43.0    2.2  TRUE
## 109        NA 0.6098830   10.8    65.6  13066     61   18.0   25.8 FALSE
## 110 1.4930748 0.8593272   12.5    64.4  16367    240  103.0   16.2  TRUE
## 111 0.8109756 0.6104513   13.0    68.9   9788    190   48.3   17.1  TRUE
## 112 0.8558140 0.6568396   11.9    72.9   7643    110   67.0   16.8  TRUE
## 113 0.9074074 0.2319277   13.0    72.9   4699     NA   45.8     NA FALSE
## 114        NA 0.6362434   11.5    68.4   5567     36   38.8   16.4 FALSE
## 115 1.0345369 0.6411543   11.3    68.2   7915    120   46.8   27.1  TRUE
## 116 0.8440367 0.6050633   12.3    73.0   7349     69   76.0   27.4  TRUE
## 117 0.9578393 0.7355372   13.6    57.4  12122    140   50.9   40.7  TRUE
## 118 0.8342697 0.8880779   11.9    75.8   5092     49   29.0   24.3  TRUE
## 119 0.8054146 0.7935723   13.2    68.3   5760    200   71.9   51.8  TRUE
## 120 0.9762397 0.7044025   12.5    70.6   3044     75   29.3   23.3  TRUE
## 121 0.5537849 0.2134670   10.1    69.4  14003     67   68.7   26.5  TRUE
## 122        NA 0.6152927   13.5    73.3   6094     53   70.6   20.8 FALSE
## 123        NA        NA   11.7    69.1   3432     96   18.6    0.0 FALSE
## 124 1.2615063 0.5291925   10.3    66.4   6522    250   88.5   31.3  TRUE
## 125 1.0287206 0.5902864   11.5    74.9   4457    100  100.8   39.1  TRUE
## 126 0.6854305 0.3496042   11.6    74.0   6850    120   35.8   11.0  TRUE
## 127 0.9680233 0.8587127   11.3    64.8   9418    130   54.9   37.7  TRUE
## 128 0.9439655 0.5589569   10.7    71.8   6929    140   97.2   13.3  TRUE
## 129 1.0427632 0.7639429   11.2    69.4   2517     44   42.8   15.2  TRUE
## 130 0.4770318 0.3379224   11.7    68.0   5497    190   32.8   12.2  TRUE
## 131 1.0852713 0.5162847   11.1    73.1   3938    120   84.0   25.8  TRUE
## 132 0.9855072 0.8639896   12.6    69.5   7176    120   40.9    8.3  TRUE
## 133        NA 0.4842520   11.7    68.2   5363    270   52.2   38.5 FALSE
## 134 0.7283951 0.1856946   12.3    69.6   2728     49   41.6   12.4  TRUE
## 135        NA 0.7687500   10.6    71.9   2803     86   44.8    0.0 FALSE
## 136 0.8446809 0.9383562   11.1    62.3   6012    410  126.7   11.5  TRUE
## 137        NA        NA   12.3    66.0   2434    130   16.6    8.7 FALSE
## 138        NA 0.8752711    9.0    57.6  21056    290  112.6   19.7 FALSE
## 139 0.5863636 0.8539720   13.5    60.1   3734    280  125.4   12.7  TRUE
## 140 0.6986090 0.9425770   11.5    61.4   3852    380   58.4   10.9  TRUE
## 141 0.6189189 0.9646018   10.6    66.2   4680     NA   65.0   25.0 FALSE
## 142 0.8256659 0.6825208   10.0    71.6   3191    170   80.6   20.0  TRUE
## 143 0.4323144 0.9109827   10.9    68.4   2949    170   44.3   19.0  TRUE
## 144        NA 0.5822622   11.3    66.5   2918    210   65.1   18.2 FALSE
## 145 0.8057325 0.8591160   11.0    61.6   2762    400   93.6   20.8  TRUE
## 146 0.4633508 0.9173364   12.4    69.6   2311    190   73.7   29.5  TRUE
## 147 0.4186551 0.2967431    7.8    66.2   4866    170   27.3   19.7  TRUE
## 148 1.4967320 0.9137303    8.6    65.9   4608    200   12.1    4.7  TRUE
## 149        NA 0.8231469   11.4    52.3   6822    460  170.2   36.8 FALSE
## 150 0.8423077 0.6131285   11.3    49.0   5542    310   72.0   14.7  TRUE
## 151 0.5894737 0.9767184    9.2    65.0   2411    410  122.7   36.0  TRUE
## 152        NA 0.7566719    9.0    52.8   5341    560  119.6    6.6 FALSE
## 153 0.6103152 0.8307292   10.4    55.5   2803    590  115.8   27.1  TRUE
## 154        NA 0.9569061   10.3    65.1   1328    440  122.8   20.5 FALSE
## 155 0.7854839 0.9275362   10.9    57.5   1615    470   60.3   35.1  TRUE
## 156 0.3971292 0.3628319    8.5    63.1   3560    320   73.3   22.2  TRUE
## 157        NA 0.6759494    9.2    67.9   1540    130   64.9    2.0 FALSE
## 158 0.5241379 0.9527027    9.9    62.6   2463    220   62.1    2.7  TRUE
## 159        NA 0.4394507   11.5    63.3   1456    350   51.1    3.0 FALSE
## 160 0.3220974 0.3518006    9.2    63.8   3519    270   47.0    0.7  TRUE
## 161 1.1526316 0.8027211   11.1    49.8   3306    490   89.4   26.8  TRUE
## 162 0.3995037 0.9913899   12.2    59.7   1228    450   91.5   17.6  TRUE
## 163 0.6363636 0.8577465    8.7    62.8   1669    380   42.0    3.5  TRUE
## 164 0.9090909 1.0128957   10.3    64.2   1458    320   33.6   57.5  TRUE
## 165 0.6835821 0.9570707    9.8    58.5   1613    360  126.6   35.0  TRUE
## 166 0.4185185 0.8633461   11.1    59.6   1767    340   90.2    8.4  TRUE
## 167 0.6648352 0.4118421    7.0    63.5   3809    360   84.0   23.8  TRUE
## 168        NA 0.5361891    6.4    62.0   3276    230   18.6   12.7 FALSE
## 169        NA        NA    7.6    55.7   2332    730   75.3   24.3 FALSE
## 170 0.4675325 0.7500000    7.9    66.5   2188    320   94.4   42.7  TRUE
## 171 0.1979866 0.1987421    9.3    60.4   1885    400   86.8   27.6  TRUE
## 172 0.4651163 0.6437346    8.9    51.5   3171    720  130.3    9.2  TRUE
## 173 0.5138889 1.0380368   10.8    62.8    747    510  144.8   16.7  TRUE
## 174 0.4285714 0.8756999    8.5    64.1   1428    420   78.4   25.5  TRUE
## 175 0.5523810 0.8709288    8.8    60.2   1507    430  115.8    9.4  TRUE
## 176 0.3950617 0.9658470    9.8    58.7    680    730  135.3    8.2  TRUE
## 177 0.3918575 0.8981481    9.5    60.9    805    640  117.4   10.7  TRUE
## 178        NA 0.8687898    9.0    55.2   1362    560   99.3   13.7 FALSE
## 179 0.5099338 0.6240786    8.4    58.0   1583    550  175.6    9.5  TRUE
## 180 0.2258065 1.0326087    9.3    55.1   1123    480  137.8   39.6  TRUE
## 181 0.4608295 0.9521739    8.6    50.9   1780   1100  100.7   12.4  TRUE
## 182        NA 0.8378033    8.7    58.8   1096    650  131.0   21.9 FALSE
## 183 0.2812500 0.8566667    7.8    58.7   1591    400  115.4   13.3  TRUE
## 184 0.6385542 1.0158537   10.1    56.7    758    740   30.3   34.9  TRUE
## 185 0.1717172 0.8080808    7.4    51.6   2085    980  152.0   14.9  TRUE
## 186        NA 0.8908686    4.1    63.7   1130    380   65.3   22.0 FALSE
## 187 0.3782772 0.8531140    7.2    50.7    581    880   98.3   12.5  TRUE
## 188 0.3076923 0.4459309    5.4    61.4    908    630  204.8   13.3  TRUE
## 189 0.7289916 0.3081009   12.0    70.6  15722    155   45.4   14.0  TRUE
## 190 0.8250377 0.7884131   12.7    74.0  11449     72   21.2   18.7  TRUE
## 191 0.8784119 0.6514286   13.6    72.3  12791     28   30.8   19.0  TRUE
## 192 0.9836957 0.6729323   14.0    75.0  14242     85   68.3   27.0  TRUE
## 193 0.5329670 0.3711083   11.2    68.4   5605    183   38.7   17.5  TRUE
## 194 0.7015873 0.8537859    9.6    58.5   3363    506  109.7   22.5  TRUE
## 195 0.8333333 0.6558018   12.2    71.5  14301    210   47.4   21.8  TRUE
## [1] 155   8

Summarizing the variables

Following is the summary of the current variables in the human data.

##     secedfm           labfm            eduexp         lifeexp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni             matmor           adobir           fparli     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Now let’s visualize these variables.

varlabels = c("female to male secondary educ. ratio","female to male labor force ratio","expected years of schooling","life expectancy at birth","gross national income","maternal mortality","adolescent birth rate","percentage of women in parliament","","")


selcols <- colnames(human)
selvars <- dplyr::select(human, one_of(selcols))
a=0
typeof(human)
## [1] "list"
for(var in selvars) {
   
 # if(is.numeric(var)){
  #  a = a +1
 #   plot <-ggplot(human, aes(var, ..count..)) + geom_bar() + xlab(varlabels[a]) + ylab("")
#  print(plot)

#} else { 
    a = a +1
    plot <- qplot(var,
      geom="histogram",
      binwidth = ((1/5.4) * sd(var)) * 3.49,  
      main = paste("Histogram for", varlabels[a])  ,  
      xlab = (varlabels[a]),
      ylab = ("Number of countries"),
      fill=I("blue"), 
      col=I("red"), 
      alpha=I(.2))
    print(plot)
 #   }
}

We can see that in most cases, cpuntries had a lower proportion of wpomenn with secondary education when compared with men. However there were instances where higher proportion of women had secondary educaiton compared to men. Upon closer inspection, it was Gabon, Myanmar, Guyana. In most countries there were about 20 to 30 percent less women than men in the labor force. 13 to 15 years of schooling was common in most countries. Life expectancy at birth in most countries was about 75 years. GNI per capita was left skewed.Maternal mortality was also left skewed but very high in some countries, for example upto 1100 maternal deaths per 100,000 live births in Sierra Leone. Percentage of women variable was normally ditributed with a peak at 20 percent.

Let’s check the corellations between these variables.

cor_matrix<-cor(human) 

corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Strong negative associations were between 1. maternal mortality and life expectancy at birth 2. exposure to education and maternal mortality 3. Adolescent borth rate and life expectancy. Strong positive corellations were between 1. Adolescent borth rate and maternal mortality 2. Exposure to education and life expectancy.

Principal componenet analysis

pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

### Principal componenet analysis (after standardization)

human_std <- scale(human)
pca_human <- prcomp(human_std)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.5, 1), cex.axis = 0.7, col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA biplot ", cex.sub = 0.8, sub = "after standardizing the variables, GNI is no more the strongest contributing variable in the PCA anlysis. This is because the scaling makes it comparable in size to other variables.")

?biplot
## starting httpd help server ... done

The results are very different after standardization because standardization changes the scale of the data. Before, one variable (i.e. GNI) had a lot of variability and it was being given the highest weight because of that. PC1 was based entirely on variability within gni. After standardization, other variables could also influence the results.

Interpretation

According to my interpretation, PCA is the strongest principal component (i.e. catches the highest amount of variability, 53%). It is influenced by maternal mortality, adolescent birth rate in one direction and education exposure, life expectancy, secondary education ratio between men and women, and gni in the other direction.

The second pricipal component captures 16 percent variability in the data. It is influenced by number of women in the parliament and the participation of women in labor force.

Tea dataset

data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

The tea dataset is a part of factominer package and contains information about tea drinking preferences of 300 individuals.

Following is a distribution of participants according to their age and gender.

  age <- tea$age
      plot <- qplot(age,
      geom="histogram",
      binwidth = ((1/6.69) * sd(var)) * 3.49,  
      main = paste("Age and sex distribution of participants")  ,  
      xlab = ("Age"),  
      fill=tea$sex, 
      col=I("red"), 
      alpha=I(.2)) + scale_fill_discrete(name = "Gender") 
    print(plot) 

The following graphs describe the characteristics of the individuals in terms of key varaibale groups. First we explore the time preference for cosuming tea, i.e. lunch, evenign or dinner. IN the second group we explore the location where they consume tea usually and from where they purchase it. In the third group we explore how the tea is consumed.

selcols <- c("evening", "dinner", "lunch")
group1 <-select(tea, one_of(selcols))
gather(group1) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

selcols <- c("home", "pub", "work", "tearoom", "where")
group2 <-select(tea, one_of(selcols))
gather(group2) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

selcols <- c("How", "sugar", "how")
group2 <-select(tea, one_of(selcols))
gather(group2) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Multiple correspondence analysis

I chose certain groups for the purpose of analyses.

#c("Tea" ,"evening", "dinner", "lunch", "sugar", "where", "home", "pub", "work", "tearoom" , "how","How", "sugar")

grp1 <- c("Tea" ,"evening", "dinner", "lunch")
grp2 <- c("Tea", "sugar", "how")
grp <- c()

a=0

for (num in 1:2) {
a=a+1

curgrp <- get(paste('grp',a, sep = ''))
curgrp
tea_mca <- select(tea, one_of(curgrp))

#gather(tea_mca) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

mca <- MCA(tea_mca, graph = FALSE)

p <- plot(mca, invisible=c("ind"), habillage = "quali")
print(p)

}

  1. I first wanted to investigate is the time of consuming tea is related to the type of tea consumed. I found that green tea is usually consumed for dinner. Earl grey and black are usually not consumed for dinner. People were most likely to not consume tea with lunch.

  2. In the second group of variables, I found that Earl Greay is usually consumed in tea bag for and with sugar. Green tea is bought both packaged and unpackaged and black tea is mostly bought unpackaged Sygar is susually not added to green or black tea.